Github Repository

Github Page

Gender Effect of Financial Analyst Forecast During Covid-19

Qinzheng Xu

12/15/2022

stock-market-modern-flat-concept-for-web-banner-design-woman-is-engaged-in-trading-analyzes-financial-data-with-growth-trend-and-bargain-invests-money-illustration-with-isolated-people-scene-vector.jpg

Agenda

  • Introduction

  • Data and Summary Statistics

    • I. Data Sources

    • II. Examining The Analyst Forecast Data

    • III. Examining The Covid-19 Data

    • IV. Merge Two Analyst Forecast Data and Covid-19 Data

  • Measure and Variable Defination

  • Exploratory Data Analysis

    • I. Pessimism

    • II. Herding

    • III. Updating Frequency

    • IV. Rounding

    • V. Bold_d

    • VI. Reissue

  • Empirical Results

    • I. Linear Regression Gender Effect

      • Pessimism Results

      • Herding Results

      • Updating_Frequency Results

      • Bold_d Results

      • Rounding Results

      • Reissue Results

    • II. Stock Market Reaction Gender Effect

      • KNN fit with Gender

      • KNN fit without Gender

  • Conclusion

Introduction

Back

Covid-19 has exerted a tremendous impact on society both in the economy and social lifestyle. Among those impacts introduced by Covid-19, the female and male heterogeneous effect of Covid has been focused on both from the industry side and academic side. Lots of media reports that females are suffering more loss and losing bargaining power in the workplace. However, is covid-19 introduced heterogeneous loss towards males and females that could influence the work quality of females? It may not be true. This research aims to provide evidence of whether females and males behave differently during the covid-19 period by focusing on financial analyst jobs.

Why does this research focus on financial analysts? the rationality lies in two reasons. First, Financial Analyst issues their forecast for earnings announcement almost every day, providing us with good resources of information to track their work quality. Second, it is widely documented in the literature that financial analysts and their forecasts are important sources of information for investors. Analysts provide both forward-looking information innovation and analyze information already released to the market, thereby bridging the information gaps between the public firms and the investors, leveling the playing ground among investors, reducing overall information asymmetry, and enhancing market efficiency. Therefore, financial analysts' forecast not only enables us to measure financial analyst work quality but also have significant economic significance in helping investors make investment decisions.

This project focuses on female financial analysts during the covid-19 period, investigates the impact of the Covid-19 pandemic on female forecast behaviors compared with males, and research whether female-issued forecasts influence investors making decisions for the earnings announcement measured through post-earning announcement 10 days returns.

Data

Back

I. Data Sources

Back

Institutional Brokers' Estimate System (I/B/E/S)

The Institutional Brokers' Estimate System, or IBES is a database of analyst estimates and company guidance for more than 23,000 public companies. The database aggregates all of the available financial data on companies and company sectors to aid in decision-making. It features a host of data from equity analyst consensus to forward guidance. Historical data is available from 1976 when IBES was introduced, with international data going back to 1987.

I build up the sample by first obtaining analyst forecast data from Thomson Reuters Institutional Brokers Estimate System (I/B/E/S) starting from March 2019, one year before the shock of the Covid-19 pandemic officially ascertained by the WHO, through November 2021, roughly keeping even sample periods before and after the pandemic shock. Specifically, I use data files of I/B/E/S detail history (unadjusted) for quarterly analyst forecasts as measured by earnings per share (EPS) for US companies. In particular, I only select analyst forecasts whose forecast period indicator (FPI) from I/B/E/S is either 6 or 7, namely forecasts only for the current and the next fiscal quarter to ensure the timeliness of the forecast-related measures. Furthermore, I adjust all estimate and earnings announcement dates to the closest preceding trading date in CRSP to match the corresponding adjustment factors. Then the estimates are adjusted by CRSP adjustment factors to ensure the same per-share basis as the company reported EPS.

Covid-19 Data in the United States

The covid-19 case data is from New York Times Covid-19 database. The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. This compiled panel data is from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak. Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

The data has been used to power New York Times maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists, and government officials who would like access to the data to better understand the outbreak.

The reason why I use Covid-19 data is to replace the covid-19 post dummy variable with real covid-19 7 days average new cases number to check the robustness of my results for financial analyst behaviors.

II. Examining The Analyst Forecast Data

Back

To carry on the analyst forecast analysis, I need to prepare the forecast data set and covid-19 dataset correctly and then merge them. This section will first go through the forecast dataset and check whether there are any outliers or missing data in each variable. Then I will merge the analyst data and covid-19 dataset by two keys: state and Year_month. Since the state in the analyst forecast data means the analyst living location, I need to further check whether the analyst is clustering at certain states. In the second step, I will check covid-19 data with two graphs to see whether the covid-19 data is following our intuitions about covid-19 spreading in the US.

Forecast Data Summary Statistics

In [1]:
import pandas as pd
import numpy as np
#pd.set_option('display.max_rows', None)

data1 = pd.read_csv("ana1.csv")
data2 = pd.read_csv("ana2.csv")
data3 = pd.read_csv("ana3.csv")
data4 = pd.read_csv("ana4.csv")
data5 = pd.read_csv("ana5.csv")
data6 = pd.read_csv("ana6.csv")
data7 = pd.read_csv("ana7.csv")

data = pd.concat([data1, data2])
data = pd.concat([data, data3])
data = pd.concat([data, data4])
data = pd.concat([data, data5])
data = pd.concat([data, data6])
data = pd.concat([data, data7])

# View All Columns in dataset.
print(data.columns[1:])
Index(['Unnamed: 0', 'AMASKCD', 'ANALYST', 'ESTIMID', 'TICKER', 'ESTIMATOR',
       'ANALYS', 'VALUE', 'FPEDATS', 'REVDATS', 'REVTIMS', 'ANNDATS',
       'ANNTIMS', 'permno', 'basis', 'repdats', 'act', 'new_value',
       'Top_Broker', 'fqtr', 'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage',
       'ROA', 'Cash_holding', 'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual',
       'EARNGROWTH', 'ANNDATS2', 'REVDATS2', 'repdats2', 'Year_Quarter',
       'Year_Month', 'SIC4', 'gender', 'city', 'state_code', 'Zipcode'],
      dtype='object')

As we could see in the above results, there are in total 41 variables in the analyst dataset. In this section, I will exhibit the summary statistics and introduce definitions of the main variable.

  • AMASKCD:The analyst mask ID in the IBES dataset.

  • ESTIMID:The analyst working brokage in the IBES dataset.

  • TICKER:The analyst forecasted firm ticker.

  • act: is real earning announcement value released on the firm earning announcement day.

  • new_value: is analyst forecast earning announcement value released before the firm earning announcement day.

  • Top_Broker: is dummy variable to label whether the analyst is working on famous broker in the US.

  • fqtr: firm fiscal quarter.

  • EXPERIENCE: current forecast issuing year minus the year of analyst first release forecast.

  • EXPWITHFIRM: current forecast issuing year minus the year of analyst first release forecast for a specific firm.

  • Size: firm foundamental. Calculated with log(Stock Price * shares)

  • Leverage: firm leverage ratio.

  • ROA: firm return on asset ratio.

  • Cash_holding: firm cash holding.

  • RD: firm research and development expenditure.

  • Total_asset: firm total Assets, including long term assets and short term assets.

  • BM: firm book to market ratio.

  • INSTOWN: firm institutional share holding ratio.

  • accrual: firm accrual.

  • EARNGROWTH: firm earning growth, calculated by (current earnings - last year same fiscal quarter earnings)/last year same fiscal quarter earnings.

  • ANNDATS2: analyst forecast released date.

  • REVDATS2: analyst forecast revised date.

  • repdats2: earning announcement date.

  • Year_Quarter: analyst forecast year quarter.

  • Year_Month: analyst forecast year month.

  • SIC4: firm industry code.

  • gender: analyst gender.

  • city: analyst living city.

  • state_code: analyst living state.

  • Zipcode: analyst living zipcode.

In [2]:
# Generate Summary Statistics.
data[['act', 'new_value', 'Top_Broker', 'fqtr',
       'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage', 'ROA', 'Cash_holding',
       'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual', 'EARNGROWTH',
       'ANNDATS2', 'REVDATS2', 'repdats2', 'Year_Quarter', 'Year_Month',
       'SIC4', 'gender']].describe().T
Out[2]:
count mean std min 25% 50% 75% max
act 725112.0 8.348247e-01 2.579519 -2.174900e+02 3.000000e-02 5.300000e-01 1.270000e+00 1.577800e+02
new_value 725112.0 7.157306e-01 1.186827 -2.170000e+00 4.000000e-02 4.800000e-01 1.150000e+00 5.030000e+00
Top_Broker 725112.0 2.577409e-01 0.437391 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00
fqtr 725112.0 2.497406e+00 1.109664 1.000000e+00 2.000000e+00 2.000000e+00 3.000000e+00 8.000000e+00
EXPERIENCE 725112.0 1.372030e+01 8.865287 0.000000e+00 7.000000e+00 1.200000e+01 1.900000e+01 4.000000e+01
EXPWITHFIRM 725112.0 5.302115e+00 5.332798 -1.000000e+00 1.000000e+00 4.000000e+00 8.000000e+00 3.700000e+01
size 713746.0 8.532651e+00 1.943188 -3.147107e-01 7.262915e+00 8.536149e+00 9.884393e+00 1.465897e+01
Leverage 663543.0 3.850836e-01 40.032064 -2.951367e+03 2.713711e-01 7.262438e-01 1.491229e+00 9.731578e+02
ROA 714079.0 -4.052547e-04 0.023202 -1.105349e+01 -4.650750e-05 2.534570e-05 1.360850e-04 7.376654e-02
Cash_holding 656937.0 1.432306e-01 0.177776 0.000000e+00 2.925610e-02 8.352428e-02 1.775876e-01 9.995069e-01
RD 714463.0 1.264210e-02 0.037436 -2.631101e-01 0.000000e+00 0.000000e+00 1.174265e-02 4.905942e+00
Total_asset 714322.0 4.412519e+04 203511.230850 1.960000e-01 1.597763e+03 5.906564e+03 2.160970e+04 3.757576e+06
BM 713624.0 5.893034e-01 1.593991 -2.184231e+02 1.532555e-01 3.652840e-01 7.641880e-01 7.051203e+01
INSTOWN 714374.0 7.524225e-01 0.245178 2.827255e-07 6.469856e-01 8.122404e-01 9.186487e-01 1.530762e+01
accrual 711789.0 -9.209817e-02 1.117691 -1.979402e+02 -7.639224e-02 -2.578883e-02 -1.885338e-03 1.734346e+01
EARNGROWTH 686999.0 2.417483e-05 0.027642 -1.102792e+01 -5.415531e-05 1.124101e-06 6.143918e-05 3.312084e+00
repdats2 725112.0 2.020353e+07 8950.598631 2.019011e+07 2.020021e+07 2.020102e+07 2.021061e+07 2.022062e+07
Year_Quarter 725112.0 2.020025e+05 78.872093 2.019010e+05 2.019040e+05 2.020020e+05 2.021010e+05 2.021040e+05
Year_Month 725112.0 2.020062e+05 78.912889 2.019010e+05 2.019100e+05 2.020050e+05 2.021020e+05 2.021120e+05
SIC4 719962.0 5.710675e+03 2680.208923 1.700000e+02 3.639000e+03 5.812000e+03 7.372000e+03 9.999000e+03
gender 725112.0 9.597276e-02 0.294554 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00

As we can see from the above table, the act variable value is within the range of -214 to 157, which is huge before we adjust with the stock split and dividends. The new_value variable is the analyst forecast earning per share between -2.17 to 5.03, which is consistent with our normal intuition of earning per share range. All the other variables are within the normal intuitive range given their definition.

Analyst Location Distribution

In [3]:
import plotly.express as px

d = data[['AMASKCD', 'state_code']].drop_duplicates()
d = d.groupby('state_code').count().reset_index()
fig2 = px.choropleth(d,
                    locations='state_code', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='AMASKCD',
                    color_continuous_scale='Viridis_r')

fig2.update_layout(
      title_text = 'Analyst Location Distribution',
      title_font_family="Times New Roman",
      title_font_size = 22,
      title_font_color="black", 
      title_x=0.45)

As the above figure indicates, most analyst is located in New York State, because New York City has lots of financial institutions. This will give our first research topic a big problem, since our results may drive by a New York-based analyst. To address that, I will introduce a refined test model to check whether New York is the main driven force of our results.

III. Examining The Covid-19 Data

Back

In this section, I will first present the summary statistics of covid-19 data, then I will introduce two figures about the Covid-19 outbreak in the US to check whether the Covid-19 outbreak data is aline with our intuition. I will first lay out the covid-19 cases data on the date of 2020-03-01, which is just the beginning of Covid-19 in the US. In the second graph, I will lay out the Covid-19 cases data on the date 2021-11-25.

In [4]:
covid = pd.read_excel("covid_cases.xlsx")
covid[['fips', 'cases', 'deaths', 'new_cases', 'avg_new_cases', 'avg_new_cases_r']].describe().T
Out[4]:
count mean std min 25% 50% 75% max
fips 39816.0 32.535714 18.905041 1.0 17.750000 31.500000 46.250000 78.0
cases 39816.0 384163.296112 670592.798311 0.0 12456.500000 124343.000000 476375.500000 5892644.0
deaths 39816.0 6916.022001 11643.375990 0.0 264.000000 2186.500000 8282.000000 77042.0
new_cases 39816.0 1469.386352 3947.989144 -40527.0 30.000000 378.000000 1360.250000 193786.0
avg_new_cases 39816.0 1414.094922 3031.222743 -4502.0 95.285714 481.714286 1457.642857 69956.0
avg_new_cases_r 39816.0 0.018572 0.136090 -1.0 -0.024179 0.000000 0.039257 1.0
  • fips: US states code.

  • cases: Covid-19 cases number.

  • deaths: Covid-19 death number.

  • new_cases: Covid-19 new cases number.

  • avg_new_cases: Covid-19 new cases past 7 days average number.

  • avg_new_cases_r: Covid-19 new cases growth rate past 7 days average number.

The above statistics table listed the variable definition and variable value range in the dataset. We could see that the cases and death numbers have a huge range since the dataset recorded covid-19 cases and death numbers from the beginning. For the daily new cases number and average new cases number, since the 7-day average new cases number smoothing the number of the new cases. Therefore, we could see the standard deviation of new cases is smaller than the average number of new cases.

In [5]:
fig1 = px.choropleth(covid[covid['date'] == '2020-03-01'],
                    locations='state code', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='cases',
                    color_continuous_scale="Viridis_r")

fig1.update_layout(
      title_text = 'Covid 19 Cases on 2020-03-01',
      title_font_family="Times New Roman",
      title_font_size = 22,
      title_font_color="black", 
      title_x=0.45)

As the above figure indicates, there is a very small value in the cases when covid-19 cases first outbreak in the US on 2020-03-01.

In [6]:
fig2 = px.choropleth(covid[covid['date'] == '2021-11-25'],
                    locations='state code', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='cases',
                    color_continuous_scale='Viridis_r')

fig2.update_layout(
      title_text = 'Covid 19 Cases on 2021-11-25',
      title_font_family="Times New Roman",
      title_font_size = 22,
      title_font_color="black", 
      title_x=0.45)

As the above figure indicates, there are above million covid-19 cases across each state on the date 2021-11-25. Both graphs are quite following our intuition about the covid-19 spreading across the US. As time passes by, covid-19 accumulated cases will go up across all the states.

IV. Merge Two Analyst Forecast Data and Covid-19 Data

Back

In this section, I will perform the code of merging the analyst forecast data and the Covid-19 dataset based on time and state_code.

In [7]:
covid['date'] = pd.to_datetime(covid['date'])
covid = covid.rename(columns = {"date":"ANNDATS2",
                                "state code":"state_code"})
data['ANNDATS2'] = pd.to_datetime(data['ANNDATS2'])
data = pd.merge(data, covid[['ANNDATS2', 'state_code', 'cases', 'deaths',
       'new_cases', 'avg_new_cases', 'avg_new_cases_r']], how = "left", on = ["ANNDATS2", "state_code"])
data[['cases', 'deaths', 'new_cases', 'avg_new_cases', 'avg_new_cases_r']] = data[['cases', 'deaths',
       'new_cases', 'avg_new_cases', 'avg_new_cases_r']].fillna(0)
# Dummy return.
data['rnewcase_pos_d'] = np.int64(data['avg_new_cases_r'] > 0)
data['rnewcase_neg_d'] = np.int64(data['avg_new_cases_r'] < 0)

Measure

Back

In this section, I will calculate several important variables that proxy the behavior of financial analysts based on their earning announcement earning per share forecast. These variables include Pessimism, COVERAGESIZE, Herding, Updating Frequency, Bold_pos, Bold_neg, Bold_d, Rounding, Reissue, Forecast Age, and Rec_Chg. The definition of these each variable will be listed below:

  • Pessimism: is dummy variable if current forecast blow the 180 days consensus forecast. The consensus forecast is equal to the average value of the forecasts issued by all analysts during the same 180-day period.
  • Updating Frequency: defined as the number of forecasts issued by analyst i in month t for firm j.
  • COVERAGESIZE: Number of firms an analyst covers in the prior period.
  • Herding: is a dummy variable that takes the value of one for forecasts that are between the analyst's own prior forecast and the consensus forecast, and zero otherwise.
  • Bold_pos: Forecasts as bold if they are above both the analyst’s own prior forecast and the consensus forecast immediately prior to the analyst’s forecast. Consensus Forecast is 180 days moving average forecast by time of analyst i Issue forecast at time t.
  • Bold_neg: Forecasts as bold if they are below both the analyst’s own prior forecast and the consensus forecast immediately prior to the analyst’s forecast. Consensus Forecast is 180 days moving average forecast by time of analyst i Issue forecast at time t.
  • Bold_d: Forecasts as bold if they are above both the analyst’s own prior forecast and the consensus forecast immediately prior to the analyst’s forecast, or else below both. Consensus Forecast is 180 days moving average forecast by time of analyst i Issue forecast at time t.
  • Rounding: is a dummy variable that takes the value of one if a forecast ends with zero or five in the penny digit, and zero otherwise.
  • Reissue: is a dummy variable that takes the value of one if a forecast is reissued, and zero otherwise.
  • Forecast Age: is the natural logarithm of the number of days from the forecast to the earnings announcement date.
  • Rec_Chg: A trinary variable equal to one if analyst i upgrades his/her previously outstanding buy/hold/sell recommendation of firm j’s earnings announcement, zero if there is no change, and negative one for downgrades.
In [8]:
# 180 Day Consensus and Pessimism.
data = data.sort_values(["repdats2", 'permno', 'ANNDATS2', 'ANNTIMS']).reset_index(drop=True)
ll2 = data.groupby(["repdats2", 'permno']).rolling('180D', min_periods=1, on = "ANNDATS2")['new_value'].mean().reset_index()
ll2 = ll2.rename(columns={"new_value": "acc_avg180"})
error2 = data.groupby(["repdats2", 'permno']).rolling('180D', min_periods=1, on = "ANNDATS2")['new_value'].std().reset_index()
error2 = error2.rename(columns={"new_value": "acc_std180"})
data['shift_new_value'] = data.groupby(['AMASKCD', 'permno', 'repdats2'])['new_value'].shift()
data['acc_avg180'] = ll2['acc_avg180']
data['acc_std180'] = error2['acc_std180']
data['Pessimism_d180'] = np.int64(data['new_value'] < data['acc_avg180'])

# Firm Covered.
temp = data[['AMASKCD', 'permno', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'permno'])
temp = temp.drop_duplicates()
num_firm = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_firm = num_firm.rename(columns={'permno': 'Firm Covered'})

# COVERAGESIZE
lag_num_firm = num_firm.groupby(["AMASKCD"]).shift()
lag_num_firm = lag_num_firm.rename(columns = {"Firm Covered":"L_Firm_Covered"})
num_firm['COVERAGESIZE'] = lag_num_firm["L_Firm_Covered"]
data = pd.merge(data, num_firm, on = ['AMASKCD', 'Year_Month'], how = 'left')

# Forecast_Number.
temp = data[['AMASKCD', 'permno', "repdats", 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'permno', "repdats"])
temp = temp.drop_duplicates()
num_ea = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_ea = num_ea.rename(columns={'permno': 'Forecast_Number'})
data = pd.merge(data, num_ea[['AMASKCD', 'Forecast_Number', 'Year_Month']], on = ['AMASKCD', 'Year_Month'], how = 'left')

# Updating Frequency.
temp2 = data[['AMASKCD', 'new_value', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'new_value'])
up_freq = temp2 .groupby(['AMASKCD', 'Year_Month']).count().reset_index()
up_freq = up_freq.rename(columns={'new_value': 'Updating Frequency'})
data = pd.merge(data, up_freq, on = ['AMASKCD', 'Year_Month'], how = 'left')

# DIstinct SIC code number in pre-period.
temp = data[['AMASKCD', 'SIC4', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month', 'SIC4'])
temp = temp.drop_duplicates()
num_SIC = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_SIC = num_SIC.rename(columns={'SIC4': 'SIC Covered'})
lag_num_SIC = num_SIC.groupby(["AMASKCD"]).shift()
lag_num_SIC = lag_num_SIC.rename(columns = {"SIC Covered":"L_SIC_Covered"})
num_SIC['COVERAGEFOCUS'] = lag_num_SIC["L_SIC_Covered"]
data = pd.merge(data, num_SIC, on = ['AMASKCD', 'Year_Month'], how = 'left')

# DIstinct Forecast number in pre-period.
temp = data[['AMASKCD', 'new_value', 'Year_Month']].sort_values(by = ['AMASKCD', 'Year_Month'])
num_fore = temp.groupby(['AMASKCD', 'Year_Month']).count().reset_index()
num_fore = num_fore.rename(columns={'new_value': 'Forecast Issued'})
lag_num_fore = num_fore.groupby(["AMASKCD"]).shift()
lag_num_fore = lag_num_fore.rename(columns = {"Forecast Issued":"L_Forecast_Issued"})
num_fore['FORECASTFREQ_LAG'] = lag_num_fore["L_Forecast_Issued"]
data = pd.merge(data, num_fore, on = ['AMASKCD', 'Year_Month'], how = 'left')

# Herding.
data = data.sort_values(by = ['AMASKCD', 'permno', 'repdats2', 'ANNDATS2', 'ANNTIMS']).reset_index(drop=True)
kk = data[['AMASKCD', 'permno','repdats2', 'ANNDATS2','ANNTIMS','new_value', 'acc_avg180']]
kk['shift'] = kk.groupby(['AMASKCD', 'permno', 'repdats2'])['new_value'].shift()
herding = np.int64((kk['new_value'] > kk['acc_avg180']) & (kk['new_value'] < kk['shift'])) + np.int64((kk['new_value'] < kk['acc_avg180']) & (kk['new_value'] > kk['shift']))
data['Herding'] = herding

# Bold_pos.
data["bold_pos"] = np.int64((data['new_value'] > data['acc_avg180']) & (data['new_value'] > data['shift_new_value']))

# Bold_neg.
data["bold_neg"] = np.int64((data['new_value'] < data['acc_avg180']) & (data['new_value'] < data['shift_new_value']))

# Bold_d.
data["bold_d"] = data["bold_neg"] + data["bold_pos"]

# Rounding.
data['Rounding'] = np.int64(data['new_value']*100%5 == 0)

# Reissue.
data['Reissue'] = np.int64(data['REVDATS2'] != data['ANNDATS2'])

# Forecast Age.
data['days'] = pd.DataFrame(pd.to_datetime(data['repdats']) - pd.to_datetime(data['ANNDATS2']))[0].dt.days
data['Forecast_Age'] = np.log(data['days']+1)
data = data[data.Forecast_Age != -np.inf]
data = data.replace(-np.inf, np.nan)
data = data.replace(np.inf, np.nan)

# Rec_Chg
data = data.sort_values(by = ['AMASKCD', 'permno', 'repdats2', 'ANNDATS2', 'ANNTIMS']).reset_index(drop=True)
kk = data[['AMASKCD', 'permno','repdats2', 'ANNDATS2','ANNTIMS','new_value', 'acc_avg180']].copy()
kk = kk.groupby(['AMASKCD', 'permno', 'repdats2'])["ANNDATS2", 'new_value'].shift()
data['ANNDATS2_lag'] = kk["ANNDATS2"].copy()
data['new_value_lag'] = kk['new_value'].copy()
data['Rec_Chg'] = np.int64(data['new_value'] > data['new_value_lag'])
data['Rec_Chg'] = data['Rec_Chg'] - np.int64(data['new_value'] < data['new_value_lag'])

data = data.rename(columns = {'Pessimism_d180':'Pessimism',
                              "bold_pos":"Bold_pos",
                             'bold_neg':"Bold_neg",
                             'bold_d':'Bold_d'})
data2 = data[['Year_Month', 'Pessimism', 'COVERAGESIZE', 'Herding', 'Updating Frequency',
      'Bold_pos', 'Bold_neg', 'Bold_d', 'Rounding', 'Reissue',
      'Forecast_Age', 'Rec_Chg', 'gender']].copy()
C:\Users\xu000\AppData\Local\Temp\ipykernel_10976\267675118.py:59: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\xu000\AppData\Local\Temp\ipykernel_10976\267675118.py:88: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

EDA

Back

In this section, I will present some preliminary results of the first research proposal. I will draw both male and female forecast measures across pre-covid to post-covid to check whether females behave differently compared with males. let's first standardize the time range.

In [9]:
import seaborn as se
import matplotlib.pyplot as plt
data2['year'] = np.int32(data2['Year_Month']/100)
data2['year'] = data2['year'].values.astype('str')
data2['month'] = data2['Year_Month']%100
data2['month'] = data2['month'].values.astype('str')
data2['YM'] = data2['year'] + '-' + data2['month']
data2['YM'] = pd.to_datetime(data2['YM'])
data2['gender'] = data2['gender'].map({0:'Male', 1:'Female'})

I. Pessimism

Back

In the pessimism graph below, we could see that the red line is the date 2020-03-01, which is well known as the outbreak of covid-19. Before the covid-19 outbreak, the pessimism of males and females is not very different, but after the break out of covid-19, female is more pessimism compared with a male as indicated in the graph.

In [10]:
se.set(rc={'figure.figsize':(20,8)})
P = data2[['YM','Pessimism','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = P, x = 'YM', y = "Pessimism", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
Out[10]:
<matplotlib.lines.Line2D at 0x12109c888b0>

II. Herding

Back

Herding measures whether analysts tend to navigate their original earnings per share forecast toward the median forecast issued by another analyst. As the Herding figure indicates, before the covid-19 outbreak, females tend to herd more compared with males. This trend didn't change after the covid-19 outbreak. Whether female analysts heard more after the covid-19? this question needs more rigorous statistical tests.

In [11]:
H = data2[['YM','Herding','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Herding", hue = 'gender')
plt.axvline('2020-03-01', 0, 1, color = 'red')
Out[11]:
<matplotlib.lines.Line2D at 0x12109b36af0>

III. Updating Frequency

Back

Updating Frequency measures how frequently the analyst issues new forecasts. As the updating frequency figure indicates, males tend to update more compared with female after the covid-19, but given male is updating more compared with a female before the covid-19, this variable still needs rigorous statistical tests.

In [12]:
H = data2[['YM','Updating Frequency','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Updating Frequency", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
Out[12]:
<matplotlib.lines.Line2D at 0x12109ffeeb0>

IV. Rounding

Back

Rounding measures whether analyst tends to manipulate their forecast with an ending digit of 0 or 5. As the following graph indicates, it's hard to detect that females have more rounding behavior than males before and after the covid-19.

In [13]:
H = data2[['YM','Rounding','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Rounding", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
Out[13]:
<matplotlib.lines.Line2D at 0x12110fb4f70>

V. Bold_d

Back

Bold_d measures whether the number of boldness forecasts issued by the analyst. More boldness forecast could provide investors with more information about the incoming earnings announcement. Boldness forecast is correlated with analyst career length and employment risk. In the below graph, the female analyst doesn't have a significant intention to the male analyst in issuing bold forecasts.

In [14]:
H = data2[['YM','Bold_d','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Bold_d", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
Out[14]:
<matplotlib.lines.Line2D at 0x1210731fa90>

VI. Reissue

Back

The reissue measures whether the analyst is less confident in their formal issued forecasts. The graph listed below don't have strong evidence in showing female is tending to issue more reissue forecast than male.

In [15]:
H = data2[['YM','Reissue','gender']].groupby(['YM', 'gender']).mean().reset_index()
se.lineplot(data = H, x = 'YM', y = "Reissue", hue = 'gender')
plt.axvline('2020-03-01', 0,1, color = 'red')
Out[15]:
<matplotlib.lines.Line2D at 0x12110f626d0>

Results

Back

I. Linear Regression Gender Effect

Back

Whether female analysts and male analysts have different forecast behaviors during the covid-19? Here I propose to use a Diff-in-Diff model. The diff in diff model is standard in economics to draw causal inferences. It compares the differences of treated group and control group before and after a shock to check whether the treated group is influenced by the shock. The treated group in my model is female analysts issued forecasts and the control group is the male analysts issued forecasts. The post dummy is posted covid-19. The dependent variable used here proxy for the analyst forecasts behavior, which is pessimism, herding, rounding, bold_d, and reissue. The following equation would be the formal Diff-in-Diff model that I will use:

$Y_{ijt} = D_t + T_i + \beta D_t*T_i + X_{it} + Time_t + Industry_j + e_{ijt}$

$Y_{ijt}$ is the forecast behavior variables that we want to test.

$D_t$ is the dummy variable, if post 03-01-2020 (post covid-19), then label 1. .

$T_i$ is the dummy variable if analyst is female then label 1.

$X_{it}$ is the analyst characteristics control and firm foundamental control.

$Time_t$ is the time fixed effect which is used for control the Macroeconomy variables.

$Industry_j$ is the industry fixed effects, which is used for control the persistent industry foundamental characteristics.

For the robustness test of whether the results is driven by the New York analyst, I will perform following Model:

$Y_{ijtc} = D_t + T_i + \beta D_t*T_i + \beta_2 D_t*T_i*New York_c + X_{it} + Time_t + Industry_j + e_{ijtc}$

I add a new control variable named $\beta_2 D_t*T_i*NewYork_c$. If the $\beta_2$ is significant and $ \beta$ is insignificant, then it indicates this results is driven by the New York analysts. If $ \beta$ is significant even after control the $\beta_2 D_t*T_i*NewYork_c$, then the results is not partially driven by the New York analyst.

In the following section, I will calculate the main variable of diff in diff model and describe the basic regression function that I use.

In [16]:
import statsmodels.api as sm
from patsy import dmatrices
from statsmodels.iolib.summary2 import summary_col

# Generate Fixed Effect and main variable
data["fqtrgvkey5"] = data["repdats2"]*100000 + data["permno"]
data["AMASKCD_permno"] = data["AMASKCD"]*100000 + data["permno"]

# Post Variable.
data['post'] = np.int32(data['ANNDATS2'] > '2020-03-01')

# Generate New York based analyst forecasts.
data['NY'] = np.int32(data['state_code'] == 'NY')

data['gender_post_NY'] = data['gender']*data['post']*data['NY']
data['gender_post'] = data['gender']*data['post']

data['gender_covid_NY'] = data['gender']*data['avg_new_cases']*data['NY']
data['gender_covid'] = data['gender']*data['avg_new_cases']

data['Year_Month'] = data['Year_Month'].astype(str)
data = data[data['SIC4'].isna() == False]
data['SIC2'] = np.int32(data['SIC4']/100)
data['SIC2'] = data['SIC2'].astype(str)
data['SIC4'] = data['SIC4'].astype(str)

# Rename some key variables.
data = data.rename(columns = {'Updating Frequency':'Updating_Frequency',
                             'Firm Covered':'Firm_Covered'})

# Drop duplicates. 
data = data[['AMASKCD', 'ANALYST', 'ESTIMID', 'TICKER',
       'ESTIMATOR', 'ANALYS', 'VALUE', 'FPEDATS', 'REVDATS', 'REVTIMS',
       'ANNDATS', 'ANNTIMS', 'permno', 'basis', 'repdats', 'act', 'new_value',
       'Top_Broker', 'fqtr', 'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage',
       'ROA', 'Cash_holding', 'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual',
       'EARNGROWTH', 'ANNDATS2', 'REVDATS2', 'repdats2', 'Year_Quarter',
       'Year_Month', 'SIC4', 'gender', 'city', 'state_code', 'Zipcode',
       'cases', 'deaths', 'new_cases', 'avg_new_cases', 'avg_new_cases_r',
       'rnewcase_pos_d', 'rnewcase_neg_d', 'shift_new_value', 'acc_avg180',
       'acc_std180', 'Pessimism', 'Firm_Covered', 'COVERAGESIZE',
       'Forecast_Number', 'Updating_Frequency', 'SIC Covered', 'COVERAGEFOCUS',
       'Forecast Issued', 'FORECASTFREQ_LAG', 'Herding', 'Bold_pos',
       'Bold_neg', 'Bold_d', 'Rounding', 'Reissue', 'days', 'Forecast_Age',
       'ANNDATS2_lag', 'new_value_lag', 'Rec_Chg', 'fqtrgvkey5',
       'AMASKCD_permno', 'post', 'NY', 'gender_post_NY', 'gender_post',
       'gender_covid_NY', 'gender_covid', 'SIC2']].drop_duplicates()

After cleaning the data and generate the basic variables used in Diff in Diff model, the following function will perform five regression analysis for each dependent variables that I will use. The five regression serves for reporting following information:

Regression 1: Basic diff in diff model without control variables and fixed effect. standard error is clustered at analyst and firm level to capture the correlated forecast residules within analyst and firm level.

Regression 2: DIff in DIff model with control variables. Standard error is clustered at analyst and firm level.

Regression 3: DIff in DIff model with control variables, time fixed effect and industry fixed effect to rule out omitted variable bias introduced from macro-economy cycle and industry constant trend across time.

Regression 4: DIff in DIff model with control variables, time fixed effect and industry fixed effect and control New York driven term to test whether the results is driven by New York based analyst forecasts.

Regression 5: DIff in DIff model with control variables, time fixed effect and industry fixed effect and control New York driven term. The only difference is to replace $Post$ dummy variable with Covid-19 past 7 days new cases number to capture the heterogeneous covid-19 impact pattern across time and states.

In [17]:
# Regression Model
def Regression1(data, dep):
    # Regression 1 without control.
    y, X = dmatrices(dep + ' ~ gender_post + gender + post', data=data, return_type='dataframe')
    mod = sm.OLS(y, X)
    res1 = mod.fit(cov_type='cluster', cov_kwds={'groups': data['AMASKCD_permno']}) 

    # Regression 2 add in controls.
    test1 = data[[dep,'gender_post','gender','post','Forecast_Age','Top_Broker','EXPERIENCE','EXPWITHFIRM','COVERAGESIZE','COVERAGEFOCUS','size','ROA','RD','Total_asset','BM','INSTOWN','EARNGROWTH','Cash_holding','Leverage','AMASKCD_permno']].dropna()
    equation1 = dep + ' ~ gender_post + gender + post + Forecast_Age +Top_Broker + EXPERIENCE + EXPWITHFIRM +COVERAGESIZE +COVERAGEFOCUS+ size + ROA + RD + Total_asset + BM + INSTOWN + EARNGROWTH + Cash_holding + Leverage'
    y, X = dmatrices(equation1, data=test1, return_type='dataframe')
    mod = sm.OLS(y, X)
    res2 = mod.fit(cov_type='cluster', cov_kwds={'groups': test1['AMASKCD_permno']}) 

    # Regression 3 add in Time and industry fixed effect.
    test2 = data[[dep,'gender_post','gender','post','Forecast_Age','Top_Broker','EXPERIENCE','EXPWITHFIRM','COVERAGESIZE','COVERAGEFOCUS','size','ROA','RD','Total_asset','BM','INSTOWN','EARNGROWTH','Cash_holding','Leverage','Year_Month','SIC2','AMASKCD_permno']].dropna()
    equation2 = dep + ' ~ gender_post + gender + post +Forecast_Age+ Top_Broker + EXPERIENCE + EXPWITHFIRM + COVERAGESIZE+COVERAGEFOCUS+size + ROA + RD + Total_asset + BM + INSTOWN + EARNGROWTH + Cash_holding + Leverage + Year_Month + SIC2'
    y, X = dmatrices(equation2, data=test2, return_type='dataframe')
    mod = sm.OLS(y, X)
    res3 = mod.fit(cov_type='cluster', cov_kwds={'groups': test2['AMASKCD_permno']}) 

    # Regression 4 add in NY driven term.
    test3 = data[[dep,'gender_post_NY','gender_post','gender','post','Forecast_Age','Top_Broker','EXPERIENCE','EXPWITHFIRM','COVERAGESIZE','COVERAGEFOCUS','size','ROA','RD','Total_asset','BM','INSTOWN','EARNGROWTH','Cash_holding','Leverage','Year_Month','SIC2','AMASKCD_permno']].dropna()
    equation3 = dep+' ~ gender_post + gender + post + gender_post_NY + Forecast_Age+ Top_Broker + EXPERIENCE + EXPWITHFIRM +COVERAGESIZE +COVERAGEFOCUS+size + ROA + RD + Total_asset + BM + INSTOWN + EARNGROWTH + Cash_holding + Leverage + Year_Month + SIC2'
    y, X = dmatrices(equation3, data=test3, return_type='dataframe')
    mod = sm.OLS(y, X)
    res4 = mod.fit(cov_type='cluster', cov_kwds={'groups': test3['AMASKCD_permno']})
    
    # Regression 5 change covid-19 dummy to 7 days moving average new cases.
    test4 = data[[dep,'avg_new_cases', 'rnewcase_pos_d', 'rnewcase_neg_d','gender_covid_NY','gender_covid','gender','post','Forecast_Age','Top_Broker','EXPERIENCE','EXPWITHFIRM','COVERAGESIZE','COVERAGEFOCUS','size','ROA','RD','Total_asset','BM','INSTOWN','EARNGROWTH','Cash_holding','Leverage','Year_Month','SIC2','AMASKCD_permno']].dropna()
    equation4 = dep+' ~ gender_covid + gender + avg_new_cases + gender_covid_NY + Forecast_Age+ Top_Broker + EXPERIENCE + EXPWITHFIRM +COVERAGESIZE +COVERAGEFOCUS+size + ROA + RD + Total_asset + BM + INSTOWN + EARNGROWTH + Cash_holding + Leverage + Year_Month + SIC2'
    y, X = dmatrices(equation4, data=test4, return_type='dataframe')
    mod = sm.OLS(y, X)
    res5 = mod.fit(cov_type='cluster', cov_kwds={'groups': test4['AMASKCD_permno']})

    # Report Results
    dfoutput = summary_col([res1,res2, res3, res4, res5],stars=True, 
                           regressor_order = ['gender_post', 'gender_covid','gender_pos_covid','gender_neg_covid', 'gender_post_NY', 'gender_covid_NY','gender_pos_covid_NY','gender_neg_covid_NY','gender', 'post','avg_new_cases', 'rnewcase_pos_d','rnewcase_neg_d','Forecast_Age','Top_Broker','EXPERIENCE','EXPWITHFIRM','COVERAGESIZE','COVERAGEFOCUS','size','ROA','RD','Total_asset','BM','INSTOWN','EARNGROWTH','Cash_holding','Leverage'],
                           model_names = [dep + '1', dep + '2', dep + '3', dep + '4', dep + '5'])
    INter = dfoutput.tables[0][dfoutput.tables[0].index == 'Intercept']
    r2 = dfoutput.tables[0][dfoutput.tables[0].index == 'R-squared']
    r2adj = dfoutput.tables[0][dfoutput.tables[0].index == 'R-squared Adj.']
    content = dfoutput.tables[0][0:44]
    return pd.concat([content, INter, r2, r2adj])

Pessimism Results

Back

The Pessimism results depicts whether female analyst tend to issue more pessimistic forecast compared with male after the Covid-19.

Regression 1: gender_post term is positive and significant, means female tend to issue more pessimistic forecast compared with male after the Covid-19.

Regression 2: gender_post term is positive and less significant, means tend to issue more pessimistic forecast compared with male after the Covid-19, but this relationship is weak.

Regression 3: gender_post term is not significant, I cannot find evidence that female is more or less pessimistic compared with male.

Regression 4: gender_post term is not significant, I cannot find evidence that female is more or less pessimistic compared with male. gender_post_NY is also not significant, means New York analyst is not the driven force.

Regression 5: gender_covid term is not significant, I cannot find evidence that female is more or less pessimistic compared with male. gender_covid_NY is also not significant, means New York analyst is not the driven force.

In [18]:
pessimism = Regression1(data, dep = 'Pessimism')
pessimism
Out[18]:
Pessimism1 Pessimism2 Pessimism3 Pessimism4 Pessimism5
gender_post 0.0167*** 0.0128* 0.0103 0.0065
(0.0061) (0.0068) (0.0067) (0.0102)
gender_covid -0.0000
(0.0000)
gender_post_NY 0.0052
(0.0106)
gender_covid_NY -0.0000
(0.0000)
gender 0.0167*** 0.0176*** 0.0145** 0.0145** 0.0221***
(0.0053) (0.0060) (0.0060) (0.0060) (0.0047)
post -0.0500*** -0.0601*** 0.1411*** 0.1412***
(0.0020) (0.0024) (0.0524) (0.0524)
avg_new_cases 0.0000
(0.0000)
Forecast_Age -0.0608*** -0.0573*** -0.0573*** -0.0573***
(0.0011) (0.0011) (0.0011) (0.0011)
Top_Broker -0.0030 -0.0035 -0.0036 -0.0037
(0.0032) (0.0031) (0.0031) (0.0031)
EXPERIENCE 0.0000 0.0005*** 0.0005*** 0.0005***
(0.0002) (0.0002) (0.0002) (0.0002)
EXPWITHFIRM 0.0006** -0.0001 -0.0001 -0.0001
(0.0003) (0.0003) (0.0003) (0.0003)
COVERAGESIZE 0.0015*** 0.0001 0.0001 0.0001
(0.0003) (0.0003) (0.0003) (0.0003)
COVERAGEFOCUS 0.0014*** -0.0003 -0.0003 -0.0003
(0.0005) (0.0005) (0.0005) (0.0005)
size -0.0008 0.0029*** 0.0029*** 0.0029***
(0.0008) (0.0008) (0.0008) (0.0008)
ROA 0.1070* 0.0948* 0.0950* 0.0942*
(0.0553) (0.0563) (0.0563) (0.0565)
RD -0.2073*** -0.2314*** -0.2315*** -0.2318***
(0.0356) (0.0386) (0.0386) (0.0386)
Total_asset -0.0000*** -0.0000*** -0.0000*** -0.0000***
(0.0000) (0.0000) (0.0000) (0.0000)
BM 0.0060*** 0.0038*** 0.0038*** 0.0038***
(0.0009) (0.0007) (0.0007) (0.0007)
INSTOWN 0.0767*** 0.0732*** 0.0732*** 0.0732***
(0.0063) (0.0064) (0.0064) (0.0064)
EARNGROWTH -0.0708 -0.0483 -0.0485 -0.0478
(0.0508) (0.0501) (0.0501) (0.0502)
Cash_holding -0.0761*** -0.0471*** -0.0472*** -0.0470***
(0.0082) (0.0086) (0.0086) (0.0086)
Leverage 0.0000 0.0001*** 0.0001*** 0.0001***
(0.0000) (0.0000) (0.0000) (0.0000)
Intercept 0.5208*** 0.7527*** 0.4484*** 0.4485*** 0.4426***
R-squared 0.0025 0.0185 0.0528 0.0528 0.0528
R-squared Adj. 0.0025 0.0184 0.0526 0.0526 0.0526

Herding Results

Back

The Herding results depicts whether female analyst tend to issue more herding forecast compared with male after the Covid-19.

Regression 1: gender_post term is positive and significant, means female analyst tend to issue more herding forecast compared with male after the Covid-19.

Regression 2: gender_post term is not significant, means we cannot find evidence that female tend to issue more herding forecasts.

Regression 3: gender_post term is not significant, means we cannot find evidence that female tend to issue more herding forecasts.

Regression 4: gender_post term is positive and significant, and gender_post_NY is negative and significant. This means New York based female analyst tend to issue less herding forecast, but other states located female analyst tend to issue more herding forecast. This results means New York based analyst is driven by some factors related with New York and we cannot argue that female analyst tend to issue more herding forecast compared with male.

Regression 5: gender_covid and gender_covid_NY following similar pattern with regression 4. This results still cannot support us argue that female analyst is issuing more herding forecast compared with male analyst.

In [19]:
Herding = Regression1(data, dep = 'Herding')
Herding
Out[19]:
Herding1 Herding2 Herding3 Herding4 Herding5
gender_post 0.0087*** 0.0033 0.0024 0.0111**
(0.0032) (0.0037) (0.0036) (0.0056)
gender_covid 0.0000**
(0.0000)
gender_post_NY -0.0119**
(0.0057)
gender_covid_NY -0.0000**
(0.0000)
gender 0.0185*** 0.0182*** 0.0184*** 0.0183*** 0.0194***
(0.0030) (0.0033) (0.0032) (0.0032) (0.0025)
post 0.0199*** 0.0089*** 0.0884*** 0.0883***
(0.0009) (0.0011) (0.0122) (0.0122)
avg_new_cases -0.0000***
(0.0000)
Forecast_Age -0.0706*** -0.0670*** -0.0670*** -0.0671***
(0.0007) (0.0007) (0.0007) (0.0007)
Top_Broker 0.0149*** 0.0134*** 0.0137*** 0.0139***
(0.0017) (0.0018) (0.0018) (0.0018)
EXPERIENCE -0.0003*** -0.0002** -0.0002** -0.0002**
(0.0001) (0.0001) (0.0001) (0.0001)
EXPWITHFIRM 0.0008*** 0.0009*** 0.0009*** 0.0009***
(0.0001) (0.0001) (0.0001) (0.0001)
COVERAGESIZE 0.0034*** 0.0022*** 0.0021*** 0.0022***
(0.0002) (0.0002) (0.0002) (0.0002)
COVERAGEFOCUS -0.0046*** -0.0012*** -0.0012*** -0.0012***
(0.0003) (0.0003) (0.0003) (0.0003)
size 0.0031*** 0.0032*** 0.0032*** 0.0032***
(0.0004) (0.0004) (0.0004) (0.0004)
ROA -0.0393 -0.0037 -0.0043 -0.0043
(0.0254) (0.0252) (0.0251) (0.0251)
RD -0.0364*** 0.0209 0.0211 0.0213
(0.0140) (0.0140) (0.0141) (0.0140)
Total_asset -0.0000 -0.0000*** -0.0000*** -0.0000***
(0.0000) (0.0000) (0.0000) (0.0000)
BM 0.0022*** 0.0008* 0.0008* 0.0008*
(0.0007) (0.0005) (0.0005) (0.0005)
INSTOWN 0.0245*** 0.0266*** 0.0266*** 0.0264***
(0.0028) (0.0029) (0.0029) (0.0029)
EARNGROWTH 0.0398 0.0190 0.0194 0.0194
(0.0259) (0.0247) (0.0246) (0.0247)
Cash_holding -0.0280*** -0.0044 -0.0041 -0.0040
(0.0039) (0.0040) (0.0040) (0.0040)
Leverage 0.0000 0.0000 0.0000 0.0000
(0.0000) (0.0000) (0.0000) (0.0000)
Intercept 0.1222*** 0.3980*** 0.1774*** 0.1773*** 0.1838***
R-squared 0.0013 0.0327 0.0421 0.0421 0.0421
R-squared Adj. 0.0013 0.0326 0.0419 0.0419 0.0419

Updating_Frequency Results

Back

The Updating_Frequency results depict whether female analyst tend to updating forecast more frequent compared with male after the Covid-19.

Regression 1: gender_post term is not significant, means we cannot find evidence that female analyst tend to updating forecast more frequent compared with male after the Covid-19.

Regression 2: gender_post term is not significant, means we cannot find evidence that female analyst tend to updating forecast more frequent compared with male after the Covid-19.

Regression 3: gender_post term is not significant, means we cannot find evidence that female analyst tend to updating forecast more frequent compared with male after the Covid-19.

Regression 4: gender_post term and gender_post_NY is not significant, means we cannot find evidence that female analyst tend to updating forecast more frequent compared with male after the Covid-19.

Regression 5: gender_covid and gender_covid_NY both significant but shows opposite sign. This means under Covid new cases measure across time and states, female tend to updating less outside of New York but New York based female analyst was increasing their updating frequency. This results shows heterogeneous effect within US but still cannot help us identify whether female's updating frequency was increasing or not.

In [20]:
Updating_Frequency = Regression1(data, dep = 'Updating_Frequency')
Updating_Frequency
Out[20]:
Updating_Frequency1 Updating_Frequency2 Updating_Frequency3 Updating_Frequency4 Updating_Frequency5
gender_post 0.2507 0.4453 -0.2062 -0.9965
(0.3916) (0.4219) (0.4171) (0.6206)
gender_covid -0.0004**
(0.0001)
gender_post_NY 1.0795
(0.7114)
gender_covid_NY 0.0004**
(0.0002)
gender -5.0981*** -6.4011*** -5.2499*** -5.2430*** -5.2906***
(0.5202) (0.5423) (0.4997) (0.5002) (0.5951)
post 4.0029*** 4.0255*** -8.9063 -8.8976
(0.2890) (0.2924) (11.5824) (11.5814)
avg_new_cases 0.0001
(0.0000)
Forecast_Age -2.5359*** -1.1931*** -1.1939*** -1.1899***
(0.1336) (0.0765) (0.0765) (0.0767)
Top_Broker 10.1380*** 9.8120*** 9.7810*** 9.7576***
(1.0381) (0.9754) (0.9790) (0.9829)
EXPERIENCE -0.1741*** -0.1245*** -0.1248*** -0.1251***
(0.0315) (0.0275) (0.0275) (0.0275)
EXPWITHFIRM -0.1818*** -0.2145*** -0.2142*** -0.2142***
(0.0510) (0.0469) (0.0469) (0.0468)
COVERAGESIZE 1.3212*** 0.7799*** 0.7802*** 0.7796***
(0.0613) (0.0587) (0.0587) (0.0587)
COVERAGEFOCUS -1.4448*** -0.0667 -0.0675 -0.0672
(0.0748) (0.1174) (0.1174) (0.1172)
size -0.4603** 0.0139 0.0136 0.0118
(0.1833) (0.1841) (0.1842) (0.1843)
ROA -7.0319** -0.3408 -0.2911 -0.2815
(3.2177) (3.6581) (3.6370) (3.6373)
RD -13.7366*** 6.2972** 6.2754** 6.2580**
(3.0646) (2.9472) (2.9461) (2.9503)
Total_asset 0.0000*** 0.0000** 0.0000** 0.0000**
(0.0000) (0.0000) (0.0000) (0.0000)
BM 0.7997*** -0.0027 -0.0026 -0.0020
(0.2246) (0.1549) (0.1549) (0.1549)
INSTOWN 4.2561*** 2.4765** 2.4753** 2.4870**
(1.2616) (1.2241) (1.2240) (1.2246)
EARNGROWTH 6.2923** 0.7919 0.7556 0.7497
(2.8112) (3.0221) (3.0089) (3.0021)
Cash_holding -11.2067*** -2.1785* -2.2044* -2.2187*
(1.2804) (1.2701) (1.2701) (1.2705)
Leverage -0.0052 -0.0029 -0.0029 -0.0029
(0.0055) (0.0055) (0.0055) (0.0055)
Intercept 31.1541*** 38.2719*** 13.1028*** 13.1136*** 12.5981***
R-squared 0.0044 0.0970 0.2035 0.2035 0.2035
R-squared Adj. 0.0044 0.0970 0.2033 0.2033 0.2033

Bold_d Results

Back

The Bold_d results depict whether female analyst tend to issue more Bold forecasts compared with male after the Covid-19.

Regression 1: gender_post term is not significant, means we cannot find evidence that female analyst tend to issue more Bold forecasts compared with male after the Covid-19.

Regression 2: gender_post term is not significant, means we cannot find evidence that female analyst tend to issue more Bold forecasts compared with male after the Covid-19.

Regression 3: gender_post term is not significant, means we cannot find evidence that female analyst tend to issue more Bold forecasts compared with male after the Covid-19.

Regression 4: gender_post term and gender_post_NY is not significant, means we cannot find evidence that female analyst tend to issue more Bold forecasts compared with male after the Covid-19.

Regression 5: gender_covid and gender_covid_NY both insignificant.

In [21]:
Bold_d = Regression1(data, dep = 'Bold_d')
Bold_d
Out[21]:
Bold_d1 Bold_d2 Bold_d3 Bold_d4 Bold_d5
gender_post 0.0015 -0.0026 -0.0062 0.0069
(0.0038) (0.0044) (0.0043) (0.0065)
gender_covid 0.0000
(0.0000)
gender_post_NY -0.0178***
(0.0067)
gender_covid_NY -0.0000
(0.0000)
gender 0.0292*** 0.0220*** 0.0217*** 0.0216*** 0.0142***
(0.0035) (0.0040) (0.0038) (0.0038) (0.0032)
post 0.0803*** 0.0483*** 0.2490*** 0.2488***
(0.0013) (0.0015) (0.0349) (0.0349)
avg_new_cases -0.0000**
(0.0000)
Forecast_Age -0.1859*** -0.1824*** -0.1824*** -0.1824***
(0.0013) (0.0012) (0.0012) (0.0012)
Top_Broker -0.0004 -0.0020 -0.0015 -0.0017
(0.0030) (0.0030) (0.0030) (0.0030)
EXPERIENCE 0.0002* 0.0003** 0.0003** 0.0003**
(0.0001) (0.0001) (0.0001) (0.0001)
EXPWITHFIRM 0.0029*** 0.0024*** 0.0024*** 0.0024***
(0.0002) (0.0002) (0.0002) (0.0002)
COVERAGESIZE 0.0004* 0.0012*** 0.0012*** 0.0012***
(0.0002) (0.0002) (0.0002) (0.0002)
COVERAGEFOCUS 0.0064*** 0.0045*** 0.0045*** 0.0045***
(0.0004) (0.0005) (0.0005) (0.0005)
size -0.0013* -0.0007 -0.0007 -0.0007
(0.0007) (0.0007) (0.0007) (0.0007)
ROA 0.0012 0.0050 0.0042 0.0053
(0.0507) (0.0465) (0.0463) (0.0466)
RD -0.2536*** -0.2370*** -0.2366*** -0.2364***
(0.0331) (0.0335) (0.0335) (0.0335)
Total_asset -0.0000** -0.0000*** -0.0000*** -0.0000***
(0.0000) (0.0000) (0.0000) (0.0000)
BM 0.0020*** 0.0005 0.0005 0.0005
(0.0007) (0.0006) (0.0006) (0.0006)
INSTOWN 0.0630*** 0.0525*** 0.0525*** 0.0523***
(0.0053) (0.0052) (0.0052) (0.0052)
EARNGROWTH 0.0202 0.0232 0.0238 0.0228
(0.0468) (0.0449) (0.0448) (0.0450)
Cash_holding -0.0520*** -0.0384*** -0.0380*** -0.0383***
(0.0063) (0.0065) (0.0065) (0.0065)
Leverage 0.0000 0.0000 0.0000 0.0000
(0.0000) (0.0000) (0.0000) (0.0000)
Intercept 0.3393*** 1.1311*** 0.8086*** 0.8085*** 0.8171***
R-squared 0.0067 0.1006 0.1225 0.1225 0.1224
R-squared Adj. 0.0067 0.1006 0.1223 0.1223 0.1222

Rounding Results

Back

The Rounding results depict whether female analyst tend to issue more Rounding forecasts compared with male after the Covid-19.

Regression 1: gender_post term is not significant, means we cannot find evidence that female tend to issue more or less rounding forecasts.

Regression 2: gender_post term is not significant, means we cannot find evidence that female tend to issue more or less rounding forecasts.

Regression 3: gender_post term is not significant, means we cannot find evidence that female tend to issue more or less rounding forecasts.

Regression 4: gender_post term and gender_post_NY are not significant, means we cannot find evidence that female tend to issue more or less rounding forecasts.

Regression 5: similar results with regression 4.

In [22]:
Rounding = Regression1(data, dep = 'Rounding')
Rounding
Out[22]:
Rounding1 Rounding2 Rounding3 Rounding4 Rounding5
gender_post -0.0011 -0.0020 -0.0026 0.0003
(0.0034) (0.0039) (0.0039) (0.0058)
gender_covid -0.0000
(0.0000)
gender_post_NY -0.0040
(0.0060)
gender_covid_NY -0.0000
(0.0000)
gender -0.0043 0.0029 0.0014 0.0014 0.0017
(0.0034) (0.0038) (0.0037) (0.0037) (0.0029)
post 0.0017 0.0004 0.0104 0.0103
(0.0011) (0.0012) (0.0376) (0.0376)
avg_new_cases 0.0000***
(0.0000)
Forecast_Age -0.0054*** -0.0054*** -0.0054*** -0.0054***
(0.0007) (0.0007) (0.0007) (0.0007)
Top_Broker -0.0046** -0.0052*** -0.0051*** -0.0055***
(0.0019) (0.0019) (0.0019) (0.0019)
EXPERIENCE 0.0006*** 0.0006*** 0.0006*** 0.0006***
(0.0001) (0.0001) (0.0001) (0.0001)
EXPWITHFIRM 0.0008*** 0.0006*** 0.0006*** 0.0006***
(0.0002) (0.0002) (0.0002) (0.0002)
COVERAGESIZE -0.0007*** -0.0006*** -0.0006*** -0.0006***
(0.0001) (0.0002) (0.0002) (0.0002)
COVERAGEFOCUS 0.0009*** 0.0004 0.0004 0.0004
(0.0003) (0.0003) (0.0003) (0.0003)
size -0.0042*** -0.0047*** -0.0047*** -0.0047***
(0.0004) (0.0004) (0.0004) (0.0004)
ROA -0.0051 -0.0006 -0.0008 -0.0005
(0.0296) (0.0295) (0.0295) (0.0295)
RD -0.0671*** -0.0533*** -0.0533*** -0.0537***
(0.0155) (0.0154) (0.0154) (0.0154)
Total_asset -0.0000* -0.0000 -0.0000 -0.0000
(0.0000) (0.0000) (0.0000) (0.0000)
BM -0.0011*** -0.0009*** -0.0009*** -0.0008***
(0.0003) (0.0003) (0.0003) (0.0003)
INSTOWN 0.0010 0.0006 0.0006 0.0007
(0.0029) (0.0029) (0.0029) (0.0029)
EARNGROWTH -0.0376** -0.0411*** -0.0410*** -0.0411***
(0.0158) (0.0149) (0.0149) (0.0149)
Cash_holding 0.0110*** 0.0134*** 0.0135*** 0.0133***
(0.0043) (0.0045) (0.0045) (0.0045)
Leverage -0.0000** -0.0000* -0.0000* -0.0000*
(0.0000) (0.0000) (0.0000) (0.0000)
Intercept 0.1741*** 0.2275*** 0.0519*** 0.0519*** 0.0452***
R-squared 0.0000 0.0012 0.0030 0.0030 0.0030
R-squared Adj. 0.0000 0.0012 0.0028 0.0028 0.0028

Reissue Results

Back

The Reissue results depict whether female analysts tend to revise their forecasts compared with the male after Covid-19.

Regression 1: gender_post term is not significant, means we cannot find evidence that female analysts tend to revise their forecasts compared with the male after Covid-19.

Regression 2: gender_post term is not significant, means we cannot find evidence that female analysts tend to revise their forecasts compared with the male after Covid-19.

Regression 3: gender_post term is not significant, means we cannot find evidence that female analysts tend to revise their forecasts compared with the male after Covid-19.

Regression 4: gender_post term and gender_post_NY are not significant, means we cannot find evidence that female analysts tend to revise their forecasts compared with the male after Covid-19.

Regression 5: similar results with regression 4.

In [23]:
Reissue = Regression1(data, dep = 'Reissue')
Reissue
Out[23]:
Reissue1 Reissue2 Reissue3 Reissue4 Reissue5
gender_post -0.0063 -0.0062 -0.0033 -0.0141
(0.0055) (0.0060) (0.0060) (0.0099)
gender_covid -0.0000
(0.0000)
gender_post_NY 0.0147
(0.0107)
gender_covid_NY 0.0000
(0.0000)
gender 0.0234*** 0.0307*** 0.0325*** 0.0326*** 0.0316***
(0.0051) (0.0056) (0.0056) (0.0056) (0.0048)
post -0.0012 -0.0043** -0.0609 -0.0608
(0.0019) (0.0021) (0.0543) (0.0543)
avg_new_cases 0.0000**
(0.0000)
Forecast_Age 0.0795*** 0.0729*** 0.0729*** 0.0730***
(0.0011) (0.0010) (0.0010) (0.0010)
Top_Broker 0.0001 0.0018 0.0014 0.0014
(0.0031) (0.0031) (0.0031) (0.0031)
EXPERIENCE 0.0006*** 0.0005*** 0.0005*** 0.0005***
(0.0002) (0.0002) (0.0002) (0.0002)
EXPWITHFIRM -0.0018*** -0.0012*** -0.0012*** -0.0012***
(0.0003) (0.0003) (0.0003) (0.0003)
COVERAGESIZE 0.0000 -0.0002 -0.0002 -0.0002
(0.0003) (0.0003) (0.0003) (0.0003)
COVERAGEFOCUS 0.0026*** 0.0018*** 0.0018*** 0.0018***
(0.0006) (0.0006) (0.0006) (0.0006)
size -0.0128*** -0.0106*** -0.0106*** -0.0106***
(0.0008) (0.0008) (0.0008) (0.0008)
ROA 0.1537** 0.0968* 0.0975* 0.0971*
(0.0639) (0.0535) (0.0534) (0.0535)
RD 0.0785*** 0.0102 0.0099 0.0096
(0.0290) (0.0284) (0.0284) (0.0284)
Total_asset -0.0000*** -0.0000*** -0.0000*** -0.0000***
(0.0000) (0.0000) (0.0000) (0.0000)
BM 0.0007 0.0008 0.0008 0.0008
(0.0006) (0.0006) (0.0006) (0.0006)
INSTOWN 0.0058 0.0096* 0.0096* 0.0098*
(0.0052) (0.0054) (0.0054) (0.0054)
EARNGROWTH -0.1414*** -0.1009** -0.1014** -0.1010**
(0.0475) (0.0413) (0.0413) (0.0413)
Cash_holding -0.0041 -0.0163** -0.0167** -0.0166**
(0.0078) (0.0083) (0.0083) (0.0083)
Leverage 0.0001*** 0.0001*** 0.0001*** 0.0001***
(0.0000) (0.0000) (0.0000) (0.0000)
Intercept 0.6364*** 0.3678*** 0.2305*** 0.2306*** 0.2219***
R-squared 0.0001 0.0229 0.0301 0.0301 0.0301
R-squared Adj. 0.0001 0.0228 0.0298 0.0299 0.0299

Overall, after I do the linear regression analysis for pessimism, updating frequency, rounding, herding, bold_d, and reissue. I didn't find there exist robust evidence that female analyst behaves differently compared with male analyst after the covid-19 outbreak. The female analyst was doing the same quality job compared with the male analyst after covid-19. However, whether the stock market detects some information from the noisy forecast issued by a female? The following test will help us find whether stock market investors detect some information that female analyst conveys to them.

II. Stock Market Reaction Gender Effect

Back

To test whether investors detect information differently from female analysts compared with that male analysts. I will perform the following analysis.

I will use the post-earning announcement 10 days buy and hold return to proxy the investor's reaction. I will use the KNN regression model to predict the post-earning announcement 10 days buy and hold return by using pre_earning announcement characteristics, including analyst number, analyst gender ratio, average analyst forecast behavior variables for each earning announcement... etc. I will perform this KNN prediction model with analyst gender ratio and without analyst gender ratio to check whether the absolute prediction error is decreased or not.

Let's first load the 10 days return data and merge it with my current analyst forecast datasets. The earning announcement post 10 days return data is from the IBES database.

In [24]:
ret = pd.read_csv("return.csv")
ret['repdats'] = pd.to_datetime(ret['repdats'])
data['repdats'] = pd.to_datetime(data['repdats'])
data2 = pd.merge(data, ret[['AMASKCD','ANNDATS', 'permno','repdats', 'bhar0_10', 'bhar0_5']], how = "left", on = ['AMASKCD','ANNDATS','permno', 'repdats'])
data2 = data2[['AMASKCD', 'ANALYST', 'ESTIMID', 'TICKER', 'ESTIMATOR', 'ANALYS',
       'VALUE', 'FPEDATS', 'REVDATS', 'REVTIMS', 'ANNDATS', 'ANNTIMS',
       'permno', 'basis', 'repdats', 'act', 'new_value', 'Top_Broker', 'fqtr',
       'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage', 'ROA', 'Cash_holding',
       'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual', 'EARNGROWTH',
       'ANNDATS2', 'REVDATS2', 'repdats2', 'Year_Quarter', 'Year_Month',
       'SIC4', 'gender', 'city', 'state_code', 'Zipcode', 'cases', 'deaths',
       'new_cases', 'avg_new_cases', 'avg_new_cases_r', 'rnewcase_pos_d',
       'rnewcase_neg_d', 'shift_new_value', 'acc_avg180', 'acc_std180',
       'Pessimism', 'Firm_Covered', 'COVERAGESIZE', 'Forecast_Number',
       'Updating_Frequency', 'SIC Covered', 'COVERAGEFOCUS', 'Forecast Issued',
       'FORECASTFREQ_LAG', 'Herding', 'Bold_pos', 'Bold_neg', 'Bold_d',
       'Rounding', 'Reissue', 'days', 'Forecast_Age', 'ANNDATS2_lag',
       'new_value_lag', 'Rec_Chg', 'fqtrgvkey5', 'AMASKCD_permno', 'post',
       'NY', 'gender_post_NY', 'gender_post', 'gender_covid_NY',
       'gender_covid', 'SIC2', 'bhar0_10', 'bhar0_5']].drop_duplicates()

The next step is to restructure the analyst forecast data into earning announcement data. The difference between analyst forecast data and earning announcement data is analyst forecast data each observation is analyst forecast, but in the earning announcement data, each observation is earning announcement. I will perform the groupby method on analyst forecast data based on their forecasted earning announcement and average the analyst forecasts behavior variables for each earning announcement.

In [25]:
# Mean
announce_mean = data2[['repdats2','act', 'new_value',
       'Top_Broker', 'EXPERIENCE', 'EXPWITHFIRM', 'size', 'Leverage',
       'ROA', 'Cash_holding', 'RD', 'Total_asset', 'BM', 'INSTOWN', 'accrual',
       'EARNGROWTH', 'gender', 'avg_new_cases', 'rnewcase_pos_d', 'rnewcase_neg_d',
       'Pessimism', 'Firm_Covered', 'COVERAGESIZE','Forecast_Number', 'Updating_Frequency',
       'SIC Covered', 'COVERAGEFOCUS','FORECASTFREQ_LAG', 'Herding', 'Bold_pos',
       'Bold_neg', 'Bold_d', 'Rounding', 'Reissue', 'Forecast_Age','Rec_Chg', 'fqtrgvkey5',
       'bhar0_10', 'bhar0_5']].groupby('fqtrgvkey5').mean().reset_index()

# Count
announce_count = data2[['AMASKCD', 'fqtrgvkey5']].groupby('fqtrgvkey5').count().reset_index()
ann = pd.merge(announce_mean, announce_count, how = 'left', on = 'fqtrgvkey5').drop_duplicates()
ann = ann[(ann['bhar0_10'].isna() == False) & (ann['bhar0_5'].isna() == False)].dropna()

# Combine the sample.
ann.index = range(len(ann))
ann['Year_Month'] = np.int32(ann['repdats2']/100)
ann = pd.merge(ann, data2[['fqtrgvkey5', 'repdats']].drop_duplicates('fqtrgvkey5'), how = 'left', on = 'fqtrgvkey5')
In [26]:
ann.columns
ann[['new_value', 'Top_Broker','gender','Pessimism',
    'Firm_Covered',  'Forecast_Number', 
    'Updating_Frequency','SIC Covered',  'Herding',
    'Bold_d', 'Rounding', 'Reissue', 
    'Rec_Chg', 'bhar0_10']].describe().T
Out[26]:
count mean std min 25% 50% 75% max
new_value 34064.0 0.404965 0.965048 -2.170000 -0.096250 0.197074 0.720000 5.030000
Top_Broker 34064.0 0.190269 0.224243 0.000000 0.000000 0.130032 0.333333 1.000000
gender 34064.0 0.085456 0.160618 0.000000 0.000000 0.000000 0.125000 1.000000
Pessimism 34064.0 0.430733 0.260102 0.000000 0.250000 0.466667 0.647059 0.975610
Firm_Covered 34064.0 11.645346 4.720616 1.000000 8.666667 11.153846 14.042120 51.000000
Forecast_Number 34064.0 21.243844 8.729691 1.000000 15.564939 20.493590 25.928571 89.000000
Updating_Frequency 34064.0 25.127523 13.691013 1.000000 17.000000 22.769231 30.000000 199.423077
SIC Covered 34064.0 6.180787 2.406762 1.000000 4.400000 6.000000 7.679434 20.000000
Herding 34064.0 0.099764 0.108462 0.000000 0.000000 0.076923 0.173913 0.600000
Bold_d 34064.0 0.344650 0.190835 0.000000 0.230769 0.375000 0.500000 0.875000
Rounding 34064.0 0.182533 0.176999 0.000000 0.035714 0.162791 0.250000 1.000000
Reissue 34064.0 0.655673 0.234641 0.000000 0.500000 0.666667 0.812500 1.000000
Rec_Chg 34064.0 -0.055574 0.311434 -0.875000 -0.300000 0.000000 0.153846 0.857143
bhar0_10 34064.0 -0.000205 0.134695 -1.485122 -0.049557 -0.006954 0.033257 5.604665
  • new_value: average forecast value for this earning announcement.

  • Top_Broker: The proportion of top broker analyst involved in this earning announcement.

  • gender: The female analyst issued forecasts proportion of this earning announcement.

  • Pessimism: The pessimism forecast proportion for this earning announcement.

  • Firm_Covered: Average covered firms of analyst involved in this earning announcement.

  • Forecast_Number: How many forecasts issued in this earning announcement.

  • Updating_Frequency: avergae updating frequent in this earning announcement.

  • SIC Covered: Average industry covered of analyst involved in this earning announcement.

  • Herding: the herding forecast proportion in this earning announcement.

  • Bold_d: the proportion of boldness forecast in this earning announcement.

  • Rounding: the proportion of rounding forecast in this earning announcement.

  • Reissue: the proportion of reissued forecast in this earning announcement.

  • Rec_Chg: the average pessimistic score for this earning announcement.

  • bhar0_10: the 10 days post earning announcement return.

We can observe that there are 34064 observations in our sample. The next step is to carry out the KNN regression model with gender ratio and without gender ratio analysis.

I. KNN fit with Gender

Back

The following analysis reports the absolute error for the KNN regression prediction model with gender ratio. Besides the gender ratio, I will also include other variables, including new_value, Top_Broker, Pessimism, Firm_Covered, Forecast_Number, Updating_Frequency, SIC Covered, Herding, Bold_d, Rounding, Reissue, and Rec_Chg. I will train the model based on the past 30 days earning announcement data. After I train the model, I will use the current earning announcement input variable data to make predictions about its 10-day earning announcement return and calculate the absolute error based on the difference between the prediction and true value.

In [27]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from datetime import datetime, timedelta

def KNN_fit(neighbor, train_data, predict_data, feature):
    # Define the training data.
    # Represent the features as a list of dicts.    
    X_train_dict = train_data[feature].to_dict(orient="records")
    X_new_dict = predict_data[feature].to_dict()

    y_train = train_data["bhar0_10"]

    # Dummy encoding
    vec = DictVectorizer(sparse=False)
    vec.fit(X_train_dict)
    X_train = vec.transform(X_train_dict)
    X_new = vec.transform(X_new_dict)

    # Standardization
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_sc = scaler.transform(X_train)
    X_new_sc = scaler.transform(X_new)

    # K-Nearest Neighbors Model
    model = KNeighborsRegressor(n_neighbors=neighbor)
    model.fit(X_train_sc, y_train)
    predict = model.predict(X_new_sc)
    True_val = predict_data["bhar0_10"]
    return predict, True_val
In [28]:
pre1 = []
tr1 = []
ti1 = []
feature = ['new_value', 'Top_Broker','gender','Pessimism',
    'Firm_Covered',  'Forecast_Number', 'Updating_Frequency',
    'SIC Covered',  'Herding','Bold_d', 'Rounding', 'Reissue', 
    'Rec_Chg']
for start in range(100, len(ann)-100):
    end = ann.loc[start, 'repdats'] - timedelta(days=30)
    start2 = ann.loc[start, 'repdats']
    ann_in = ann[(ann['repdats']<start2)&(ann['repdats']>=end)]
    predict = ann.loc[start, :]
    predict, true = KNN_fit(neighbor = 20, train_data = ann_in, predict_data = predict,feature = feature)
    pre1.append(predict)
    tr1.append(true)
    ti1.append(ann.loc[start+1, 'repdats'])
    
    if start%3000 == 0:
        print('Progress: ', np.round(start/len(ann)*100, 3), '%')
matrix = pd.DataFrame(pre1, columns = ['predict'])
matrix['true'] = tr1
matrix['time'] = ti1
matrix['abs error'] = np.abs(matrix['predict'] - matrix['true'])
print('Absolute Error: ', matrix['abs error'].mean())
se.set(rc={'figure.figsize':(25,8)})
se.lineplot(data = matrix, x = matrix.time, y = matrix['abs error'])
plt.axvline('2020-03-01', 0,1, color = 'red')
Progress:  8.807 %
Progress:  17.614 %
Progress:  26.421 %
Progress:  35.228 %
Progress:  44.035 %
Progress:  52.842 %
Progress:  61.649 %
Progress:  70.456 %
Progress:  79.263 %
Progress:  88.07 %
Progress:  96.876 %
Absolute Error:  0.0702645217268277
Out[28]:
<matplotlib.lines.Line2D at 0x1210db3ff40>

The above graph is the absolute error that I trained based on the KNN model. The red line is covid-19 breakout time in the US. We could see the trend is quite steady and didn't have a significantly different pattern before and after the Covid 19.

The overall absolute error is around 0.0702, I will recheck this value with KNN regression without including gender.

II. KNN fit without Gender

Back

The following analysis reports the absolute error for the KNN regression prediction model without gender ratio.

In [29]:
pre2 = []
tr2 = []
ti2 = []
feature = ['new_value', 'Top_Broker','Pessimism',
    'Firm_Covered',  'Forecast_Number', 
    'Updating_Frequency','SIC Covered',  'Herding',
     'Bold_d', 'Rounding', 'Reissue', 
    'Rec_Chg']
for start in range(100, len(ann)-100):
    end = ann.loc[start, 'repdats'] - timedelta(days=30)
    start2 = ann.loc[start, 'repdats']
    ann_in = ann[(ann['repdats']<start2)&(ann['repdats']>=end)]
    predict = ann.loc[start, :]
    predict, true = KNN_fit(neighbor = 20, train_data = ann_in, predict_data = predict,feature = feature)
    pre2.append(predict)
    tr2.append(true)
    ti2.append(ann.loc[start+1, 'repdats'])
    
    if start%3000 == 0:
        print('Progress: ', np.round(start/len(ann)*100, 3), '%')
matrix2 = pd.DataFrame(pre2, columns = ['predict'])
matrix2['true'] = tr2
matrix2['time'] = ti2
matrix2['abs error'] = np.abs(matrix2['predict'] - matrix2['true'])
print('Absolute Error: ', matrix2['abs error'].mean())
se.set(rc={'figure.figsize':(25,8)})
se.lineplot(data = matrix2, x = matrix2.time, y = matrix2['abs error'])
plt.axvline('2020-03-01', 0,1, color = 'red')
Progress:  8.807 %
Progress:  17.614 %
Progress:  26.421 %
Progress:  35.228 %
Progress:  44.035 %
Progress:  52.842 %
Progress:  61.649 %
Progress:  70.456 %
Progress:  79.263 %
Progress:  88.07 %
Progress:  96.876 %
Absolute Error:  0.07015347010086732
Out[29]:
<matplotlib.lines.Line2D at 0x121111680a0>

The above graph is the absolute error that I trained based on the KNN model without gender. The red line is covid-19 breakout time in the US. We could see the trend is quite steady and didn't have a significantly different pattern before and after the Covid 19.

The overall absolute error is around 0.07015, compared with the includes gender absolute error of 0.0702, the difference is too small to consider. Therefore, I conclude that there is no difference in absolute error between without control gender variable and with control gender variable in the 10 days post earning announcement return KNN model. The results indicate that investors didn't perceive the difference in forecasts issued by females and males both before and after the covid-19.

Conclusion

Back

Do family and male work quality impact differently during the Covid-19 period? This research leverage the financial analyst forecast data to test whether female financial analyst is issuing different forecasts compared with the male after the covid-19. My first hypothesis is whether female analysts behave differently compared with males based on their issued forecast after the covid-19. My second hypothesis does investors pick up information differently based on analysts' genders. I use the IBES dataset and New York Times Covid-19 data to test my hypothesis. In summary, I found:

  • Female analyst is not behaved significantly differently compared with male analysts after the Covid-19.

  • Investors didn't pay attention to different forecasts issued by different genders.